PMCSched: Scheduling algorithms made easy
PMCSched was born as a continuation of PMCTrack, an OS-oriented performance monitoring tool for Linux. With the PMCSched framework we take PMCTrack’s potential one step further by enabling rapid development of OS support for scheduling and resource management for Linux within a loadable kernel module. The diagram below illustrates the addition of PMCSched into the PMCTrack design.
PMCSched is built on top of PMCTrack, and so the first step requires installing and building PMCTrack. Instructions for that can be found here. Those instructions are highly encourage to be read, since they include insights on the use of PMCTrack from user space, from the OS scheduler, and PMCTrack monitoring modules.
Once PMCTrack is installed we can go ahead and start creating our custom scheduling algorithm. A new scheduling or resource management algorithm can be implemented by creating a scheduling plugin. We start by defining our new plugin on PMCSched's main header pmcsched.h. For example, with a new example plugin with ID SCHED_EXAMPLE (added into the enum of policies). Our new plugin has to be included into the array of available schedulers too, in that same header.
pmcsched.h
/* Supported Scheduling policies */ typedef enum { (...) SCHED_EXAMPLE, (...) NUM_SCHEDULERS /* Others in the future... */ } sched_policy_mm_t; }
extern struct sched_ops example_plugin;
static __attribute__ ((unused)) struct sched_ops* available_schedulers[NUM_SCHEDULERS]={ (...) &example_plugin, }; }
after that, we can start developing our new plugin right away, in a new file example.c. Creating a plugin works similarly to Linux kernel modules, the plugin follows a "contract" and implements a number of functions that every plugin should have, each with specific function attributes, and intended to handle specific events.
example.c
The various algorithm-specific operations are invoked from the core part of the scheduling framework when a key scheduling-related event occurs, such as when a threads enters the system, terminates, becomes runnable/non-runnable, or when tick processing is due to update statistics. The framework also provides a set of callbacks to carry out periodic scheduling activations from interrupt (timer) and process (kernel thread) context on each core group separately, thus making it possible to invoke a wide range of blocking and non-blocking scheduling-related kernel API calls, such as those to map a thread to a specific CPU or core group. This modular approach to creating scheduling algorithms re- sembles the one used by scheduling classes (algorithms) inside the Linux kernel, but with a striking advantage: PMCSched scheduling plugins can be bundled in a kernel module that can be loaded on unmodified kernels.
Let's go ahead and prepare the basic events that our scheduling plugin should have logic to react to:
There are also some events that our plugin could optionally track. An example of some are:
The minimum required functinos along with a policy ID, optional flags and string description. That would go into our new file:
sched_ops_t example_plugin = { .policy = SCHED_EXAMPLE, .description = "Example plugin", .flags = NULL, .sched_kthread_periodic = sched_kthread_periodic_example, .on_active_thread = on_active_thread_example, .on_inactive_thread = on_inactive_thread_example, .on_exit_thread = on_exit_thread_example, };
Some of the fields of our plugin are conceived to ease development from user space. PMCSched exposes an entry of the proc/ filesystem that includes information on the available schedulers, their descriptions, and their ids. These configuration files can also be echoed to enable or disable verbose mode. Some places within PMCSched may print meaningful information when the verbose mode is active. In general, PMCSched uses trace_printk() to output insightful debugging information, but the plugins can and should check active_scheduler_verbose before printing information with printk.
trace_printk()
active_scheduler_verbose
Once we have defined and added our new plugin, we can now start developing functions for these cases:
/* ======================================================================= * Example scheduling plugin for PMCSched * Author Carlos Bilbao * ======================================================================= */ #include static void sched_kthread_periodic_example(sized_list_t* migration_list){} static void on_active_thread_example(pmcsched_thread_data_t* t){} static void on_inactive_thread_example(pmcsched_thread_data_t* t){} static void on_exit_thread_example(pmcsched_thread_data_t* t){}
We can now start implementing the logic for our example plugin. Let us, for example, implement something that resembles a Round Robin approach.
When the task becomes active, function on_active_thread_example() will be called. We can insert it into global linked lists prepared to keep track of active, stopped, or all threads. If the task that just became active was already in the global queue of active threads nothing has to be done. However, if the task that just became active was in the global stopped threads queue or if it was in no queue whatsoever, it will have to wait for its turn to be moved into the active threads, and so it should be stopped. Further, the plugin can NOT assume the task is not in a signal pending state (the singals states is a complex topic that would require a separate discussion).
on_active_thread_example()
static void on_active_thread_rr (pmcsched_thread_data_t* t){ sized_list_t* signal_list=&pmcsched_gbl->pending_signals_list; if (t->state != ACTIVE_QUEUE && !signal_pending_th(t)) { /* Insert structure in per-application and global lists */ if (t->state == NO_QUEUE){ insert_sized_list_tail(&t->app->app_stopped_threads,t); insert_sized_list_tail(&pmcsched_gbl->stopped_threads,t); insert_sized_list_tail(&pmcsched_gbl->all_threads,t); if (active_scheduler_verbose) printk(KERN_INFO "Newly active task %d will be stopped -- RR scheduler.\n", t->prof->this_tsk->pid); } else { if (wrong_state_th(t)) printk(KERN_ALERT "Thread %d is in a non-valid state!!\n", t->prof->this_tsk->pid); if (active_scheduler_verbose) printk(KERN_INFO "Task %d in stopped queue became runnable\n", t->prof->this_tsk->pid); } t->state = STOP_PENDING; insert_sized_list_tail(signal_list,t); /* Check if it became an inactive application and update consequently */ check_inactive_app(t->app); }
When the task becomes inactive, function on_inactive_thread_example() will be called. The logic will resemble the activation case, but inverted.
on_inactive_thread_example()
static void on_inactive_thread_example(pmcsched_thread_data_t* t) sized_list_t *app_active_threads = &t->app->app_active_threads; sized_list_t *app_stopped_threads = &t->app->app_stopped_threads; sized_list_t *gbl_active_threads = &pmcsched_gbl->active_threads; sized_list_t *gbl_stopped_threads = &pmcsched_gbl->stopped_threads; /* The recently inactive task will not already be in the stopped queue */ if (t->state != STOP_QUEUE && !signal_pending_th(t)){ if (t->state == ACTIVE_QUEUE){ /* Remove structure from active per-application and global lists and add structure to stopped per-application and global lists */ switch_queues(app_active_threads,app_stopped_threads,t); switch_queues(gbl_active_threads,gbl_stopped_threads,t); if (active_scheduler_verbose) printk(KERN_INFO "Task %d went idle. Moved to stopped list by RR scheduler.\n", t->prof->this_tsk->pid); } else { if (wrong_state_th(t)) printk(KERN_ALERT "Thread %d is in a non-valid state!!\n", t->prof->this_tsk->pid); insert_sized_list_tail(gbl_stopped_threads,t); insert_sized_list_tail(app_stopped_threads,t); if (active_scheduler_verbose) printk(KERN_INFO "Newly inactive task %d added to stopped by RR scheduler.\n", t->prof->this_tsk->pid); } t->state = STOP_QUEUE; t->runnable = 0; /* Check if it became an inactive application */ check_inactive_app(t->app); }
When the periodic kthread is triggered, function sched_kthread_periodic_example() will be called. This is the function that actually implements the scheduling algorithm, using the global linked lists and making changes on which threads are allowed to run at a given time. First, we check if there is any stopped threads (since otherwise there is no point in stopping running threads):
sched_kthread_periodic_example()
void sched_kthread_periodic_example (sized_list_t* migration_list) { sized_list_t* signal_list=&pmcsched_gbl->pending_signals_list; static int schedulings = 1, cpu_prev=-1,cpu; pmcsched_thread_data_t *activated, *stopped, *elem; int task = 0; if (likely(sized_list_length(&pmcsched_gbl->stopped_threads))){ }
send_signal
elem = head_sized_list(&pmcsched_gbl->stopped_threads);
activated = NULL; for (;activated == NULL && task < sized_list_length(&pmcsched_gbl->stopped_threads) && elem != NULL; task++){ if (elem->runnable && !signal_pending_th(elem)){ activated = elem; break; } elem = next_sized_list(&pmcsched_gbl->stopped_threads,elem); }
if (activated){ switch_queues(&activated->app->app_stopped_threads, &activated->app->app_active_threads,activated); switch_queues(&pmcsched_gbl->stopped_threads, &pmcsched_gbl->active_threads,activated); activated->state = ACTIVE_PENDING; /* Add it to the list of threads to be waken up */ insert_sized_list_tail(signal_list,activated); if (active_scheduler_verbose) printk(KERN_INFO "Task %d will be made active -- RR scheduler.\n", activated->prof->this_tsk->pid); /* Check if it became an active application */ check_active_app(activated->app); }
We check that the list of actives bigger than one as we don't want to kick the recently added thread:
if (likely(sized_list_length(&pmcsched_gbl->active_threads) > 1)) { stopped = head_sized_list(&pmcsched_gbl->active_threads); if (!signal_pending_th(stopped)){ switch_queues(&activated->app->app_active_threads, &activated->app->app_stopped_threads,stopped); switch_queues(&pmcsched_gbl->active_threads, &pmcsched_gbl->stopped_threads,stopped); stopped->state = STOP_PENDING; /* Add it to the list of threads to be stopped */ insert_sized_list_tail(signal_list,stopped); if (active_scheduler_verbose) printk(KERN_INFO "Task %d will be stopped -RR scheduler\n", stopped->prof->this_tsk->pid); /* Check if it became an inactive application */ check_inactive_app(stopped->app); } } }
/* Set per-CPU mask if there are at least two cores */ if (num_online_cpus() >= 2 && sized_list_length(&pmcsched_gbl->active_threads)){ cpu = !(schedulings % 5)? 1 : 0; /* core 1 if multiple of 5 */ elem = head_sized_list(&pmcsched_gbl->active_threads); /* Updating mask only if needed */ if ((cpu != cpu_prev) || (task_cpu(elem->prof->this_tsk) != 0 && task_cpu(elem->prof->this_tsk) != 1)) { /* Important: We need an auxiliary migration list as we can't invoke sched_setaffinity with the lock */ for(task=0 ;task < sized_list_length(&pmcsched_gbl->active_threads) && elem != NULL; task++){ /* Check if needs to be masked */ if (task_cpu(elem->prof->this_tsk) != cpu) { cpumask_clear(elem->mask); cpumask_set_cpu(cpu, elem->mask); insert_sized_list_tail(migration_list,elem); if (active_scheduler_verbose) printk(KERN_INFO "CPU-mask for task %d updated now (cpu = %d). - RR\n", elem->prof->this_tsk->pid,cpu); } elem = next_sized_list(&pmcsched_gbl->active_threads,elem); } } cpu_prev = cpu; } schedulings = (schedulings == INT_MAX)? 0 : schedulings + 1;
src/modules/pmcs/intel-core/Makefile
MODULE_NAME=mchw_intel_core obj-m += $(MODULE_NAME).o PMCSCHED-objs= example_plugin.o \ (...)
Arguably, one of PMCSched's coolest features is its ability to collect information regarding Performance Monitoring Counters (PMCs), using the APIs provided by PMCTrack.
You can configure your plugin to collect certain metrics and events, such as instruction count, cycles, LLC misses and LLC references. This is particularly interesting to profile entering applications.
static metric_experiment_set_t metric_description= { .nr_exps=1, /* 1 set of metrics */ .exps={ /* Metric set 0 */ { .metrics= { PMC_METRIC("IPC",op_rate,INSTR_EVT,CYCLES_EVT,1000), PMC_METRIC("RPKI",op_rate,LLC_ACCESSES,INSTR_EVT,1000000), PMC_METRIC("MPKI",op_rate,LLC_MISSES,INSTR_EVT,1000000), PMC_METRIC("MPKC",op_rate,LLC_MISSES,CYCLES_EVT,1000000), PMC_METRIC("STALLS_L2",op_rate,STALLS_L2_MISS,CYCLES_EVT,1000), }, .size=NR_METRICS, .exp_idx=0, }, } };
enum event_indexes { INSTR_EVT=0, CYCLES_EVT, LLC_ACCESSES, LLC_MISSES, STALLS_L2_MISS, L2_LINES, PMC_EVENT_COUNT }; enum metric_indices { IPC_MET=0, RPKI_MET, MPKI_MET, MPKC_MET, STALLS_L2_MET, NR_METRICS, };
pmcsched_counter_config_t
TBS_SCHED_MODE
EBS_SCHED_MODE
static pmcsched_counter_config_t cconfig={ .counter_usage={ .hwpmc_mask=0x3b, /* bitops -h 0,1,3,4,5 */ .nr_virtual_counters=CMT_MAX_EVENTS, .nr_experiments=1, .vcounter_desc={"llc_usage","total_llc_bw","local_llc_bw"}, }, .pmcs_descr=&pmc_configuration, .metric_descr={&metric_description,NULL}, .profiling_mode=TBS_SCHED_MODE, };
counter_config
profile_thread_example()
sched_ops_t lfoc_plugin = { .policy = SCHED_EXAMPLE_PMCs, .description = "Plugin that uses PMCs", .counter_config=&cconfig, (...) .on_new_sample = profile_thread_example, };
static int profile_thread_example(pmon_prof_t* prof, int cpu,pmc_sample_t* sample,int flags,void* data) { pmcsched_thread_data_t* t = prof->monitoring_mod_priv_data; (...) lt->instr_counter += sample->pmc_counts[0]; }